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Detect overlapping and hierarchical community structure in networks 



OO 

o 
o 

(N 
> 

o 
m 



u 



> 
m 

On 
O 

m 



OO 

o 



Huawei Shen Xueqi Cheng ^0 Kai Cai ^, and Mao-Bin Hu ^ 
^Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China 
^ Craduate University of Chinese Academy of Sciences, Beijing, P.R. China 
School of Engineering Science, University of Science and Technology of China, Hefei 230026, P.R. China 

(Dated: November 3, 2008) 

Clustering and community structure is crucial for many network systems and the related dynamic 
processes. It has been shown that communities are usually overlapping and hierarchical. However, 
previous methods investigate these two properties of community structure separately. This pa- 
per proposes an algorithm (EAGLE) to detect both the overlapping and hierarchical properties of 
complex community structure together. This algorithm deals with the set of maximal cliques and 
adopts an agglomerative framework. The quality function of modularity is extended to evaluate the 
goodness of a cover. The examples of application to real world networks give excellent results. 

PACS numbers: 89. 75. He, 05.10.-a, 87.23. Ge, 89.20.Hh 



I. INTRODUCTION 

Many complex systems in nature and society can be de- 
scribed in terms of networks or graphs. Examples include 
the Internet, the world- wide- web, social and biological 
systems of various kinds, and many others 0,|2|,@1- In the 
past decade, the theory of complex network has attracted 
much attention. Complex networks are usually charac- 
terized by several distinctive properties: power law de- 
gree distribution, short path length, clustering and com- 
munity structure. The problem becomes important be- 
cause complex system's dynamics is actually determined 
by the interaction of many components and the topologi- 
cal properties of the network will affect the dynamics in a 
very fundamental way. Therefore, an efficient and sound 
approach that can capture the topological properties of 
network is needed. 

Identifying the community structure is crucial to un- 
derstand the structural and functional properties of the 
networks [1, [H, . Many methods have been proposed to 
identify the community structure of complex networks 
0, B i, [iO, EH, m. One can refer to \13\ for reviews. 
These methods can be roughly classified into two cate- 
gories in terms of their results, i.e., to form a partition 
or a cover of the network. 

The first kind of methods produce a partition, i.e each 
vertex belongs to one and only one community and is 
regarded as equally important. Different from classi- 
cal graph-partition problem, the number of communi- 
ties and the size of each community are previously un- 
known. Newman et al. proposed a quality function 
Q, namely modularity^ to evaluate the goodness of a 
partition [P]. A high value of Q indicates a signifi- 
cant community structure. Several community detec- 
tion methods have been proposed by optimizing mod- 
ularity [ll|, [13, [lH . Generally, this kind of methods are 
suitable to understand the entire structure of networks, 
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especially for the networks with a small size. Recently, 
some authors [IX, J-8] have pointed out that the optimiza- 
tion of modularity has a fundamental drawback, i.e. the 
existence of a resolution limit. 

The second kind of methods aim to discover the vertex 
sets (i.e. communities) with a high density of edges. In 
this case, overlapping is allowed, that is, some vertices 
may belong to more than one community. Meanwhile, 
some vertices may be neglected as subordinate vertices. 
Therefore, these methods result in an incomplete cover 
of the network. Numerous methods have been proposed, 
based on k-clique [8], k-dense |25| or other patterns. Un- 
fortunately, there is no commonly accepted standard to 
evaluate the goodness of a cover up to now. Compared to 
the partition methods, this kind of methods are appro- 
priate to find the cohesive regions in large-scale networks. 

In real networks, communities are usually overlapping 
and hierarchical [1, IH, [lO, |2l| . Overlapping means that 
some vertices may belong to more than one community. 
Hierarchical means that communities may be further di- 
vided into sub-communities. The two kinds of existing 
methods, as mentioned above, investigate these two phe- 
nomena separately. The first kind of methods can be 
used to explore the hierarchical community structure, 
however, they are unable to deal with overlaps between 
communities. The second kind of methods can uncover 
overlapping community structure of networks, but they 
are incapable of finding the hierarchy of communities. 
Recently, several authors begin to detect the hierarchical 
and overlapping community structure [22] . 

In this paper, a new algorithm EAGLE (agglomera- 
tivE hierarchicAl clusterinG based on maximaL cliquE) 
is presented to uncover both hierarchical and overlapping 
community structure of networks. This algorithm deals 
with the set of maximal cliques and adopts an agglom- 
erative framework. The effectiveness is demonstrated by 
applications to two real- world networks, namely the word 
association network and the scientific collaboration net- 
work. 

In Figdl we use a schematic network to illustrate what 
EAGLE can do and compare it with the two kinds of ex- 
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FIG. 1: Comparison of community structure found by different algorithms. Different communities are rendered in different 
colors. Edges between communities are colored in light gray. Overlapping region between communities are emphasized in 
red. a) The schematic network, b) The hierarchical community structure found by Newman's fast algorithm. This algorithm 
is chosen as a representative of the first kind of algorithms, c) The overlapping community structure found by the k-clique 
algorithm as a representative of the second kind of algorithms, d) The hierarchical and overlapping community structure found 
by the algorithm EAGLE. 



isting methods. FigiTJa) depicts the schematic network. 
We construct this network according to the schematic 
network in [8], which has overlapped community struc- 
ture. To construct the hierarchy of the overlapped com- 
munities, we remove the edge connecting vertices 9 and 
13 and add two edges, one connecting 10 and 15 and the 
other one connecting 10 and 13. FigIT](b) shows the com- 
munity structure found by Newman's fast algorithm pj| . 
Three community are found when applying the algorithm 
to the schematic network. The hierarchy of communi- 
ties can be revealed by applying the algorithm to each 
community further. For example, one of the three com- 
munities is divided into two sub-communities. Overlaps 
between communities are not allowed. FigHJc) demon- 
strates the overlapping community structure found by 
k-clique algorithm [8] . Unfortunately, this algorithm can 
not reveal the hierarchy of community. Figiljd) shows 
the hierarchical and overlapping community structure 
found by our algorithm. EAGLE provides a possible way 
to investigate a more complete picture of the community 
structure. 



II. THE ALGORITHM: EAGLE 

A community can be regarded as a vertex set within 
which the vertices are more likely connected to each other 
than to the rest of the network. This indicates that a 
community usually has relatively high link-density. Gen- 
erally, the link-density of a clique is highest among all 
kinds of vertex subsets of a network. Dense-linked com- 
munity usually contains a large clique, which could be 
regarded as the core of the community. Based on this 
observation, the algorithm EAGLE is proposed as an ag- 
glomerative hierarchical clustering algorithm to investi- 
gate the community structure. Different from traditional 
agglomerative algorithms [11], our algorithm deals with 
the set of maximal cliques rather than the set of sole 
vertices. 

A maximal clique is a clique which is not a subset of 
any other cliques. In the algorithm EAGLE, we need to 
firstly find out all the maximal cliques in the network. 
This can be done by many efficient parallel algorithms. 
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Here we choose the weh-known Bron-Kerbosch algorithm 
(2^ for its simphcity in implementation. Note that not 
all maximal cliques are taken into account. The maximal 
cliques, whose vertices are from some other larger maxi- 
mal cliques, are called subordinate maximal cliques. For 
example, in FiglU vertex 4 and 23 forms a subordinate 
maximal clique. Because vertex 4 is from another larger 
maximal clique {1, 2, 3, 4, 5, 6} and vertex 23 is also from 
other larger maximal cliques, including {18, 20, 21, 23}, 
{18, 20, 22, 23} and {18, 19, 22, 23}. Subordinate max- 
imal cliques may mislead our algorithm and should be 
discarded. Most subordinate maximal cliques have small 
sizes. Thus, we can discard them by setting a threshold 
k and neglecting all the maximal cliques with the size 
smaller than k. This simple tactic may also discard some 
non-subordinate maximal cliques. The higher the value 
of k is, the more non-subordinate maximal cliques are 
discarded by mistake. On the other hand, the smaller 
the value of k is, the more subordinate maximal cliques 
are remained. In real world networks, the threshold k 
typically takes value between 3 and 6. As to the network 
in FiglU both 3 and 4 are appropriate threshold values. 
As to the networks used in Secjllll 4 is demonstrated 
to be an appropriate threshold [8J. After neglecting the 
maximal clique with the size smaller than the threshold 
/c, some vertices do not belong to any remaining maximal 
cliques. We call these vertices as subordinate vertices. 

Our algorithm have two stages. In the first stage, a 
dendrogram is generated. In the second stage, we choose 
an appropriate cut which breaks the dendrogram into 
communities. The first stage of the algorithm EAGLE 
can be described as follows: 

1. Find out all maximal cliques in the network. Ne- 
glect subordinate maximal cliques. The remainders 
are taken as the initial communities. Each subor- 
dinate vertex is also taken as an initial community 
comprising the sole vertex. Calculate the similarity 
between each pair of communities. 

2. Select the pair of communities with the maximum 
similarity, incorporate them into a new one and cal- 
culate the similarity between the new community 
and other communities. 

3. Repeat step 2 until only one community remains. 

In the algorithm, the similarity M between two com- 
munities Ci and C2 is defined as 



M = — y 

2m „ ^ 



V^Ci ,W^C2 ,v^w 



2m 



(1) 



Here, A^w is the element of adjacency matrix of the 
network (We only consider undirected, unweighted net- 
works in this paper). It takes value 1 if there is an 
edge between vertex v and vertex w and otherwise, 
m = ^ Swtu ^vw is the total number of edges in the net- 
work. fc„ is the degree of v. 
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{18, 19,22, 23} 
{18,20, 22, 23} 
{18,20,21,23} 
{1,2,3,4,5,6} 
{8, 11, 12, 13} 
{7, 8, 12, 13} 
{7, 8, 10, 13} 
{7, 8,9, 10} 
{3,7,8, 13} 
{11, 12, 16, 17} 
{12, 14, 16, 17} 
{10, 14, 15, 16, 17} 



FIG. 2: Illustration of the process of EAGLE when applied 
to the schematic network in Fig{T] The bottom part is a den- 
drogram. The leaf nodes correspond to the non-subordinate 
maximal cliques. The label of each leaf node shows the ver- 
tices belonging to it. The red vertical dashed line is a cut 
through the dendrogram and it gives the best cover of the 
network. The top part of the figure is a graph which illus- 
trates the curve of EQ corresponding to each cover of the 
network. The threshold k is set to be 4. 



Similar to the fast algorithm in fn\ , the process of our 
algorithm corresponds to a dendrogram, which shows the 
order of the amalgamations. Any cut through the dendro- 
gram produces a cover of the network. As an illustration, 
Fig|2] shows the dendrogram generated by our algorithm 
when applied to the network in Fig{Tl 

The task of the second stage of the algorithm EAGLE 
is to cut the dendrogram. To determine the place of the 
cut, a measurement is required to judge the quality of a 
cover. In [25], an extension of modularity is proposed to 
evaluate the goodness of overlapped community decom- 
position. In this paper, we propose another extension 
of modularity EQ. As shown in Fig|2l the cut gives the 
best cover with the maximum value of EQ. Given a cover 
of the network, let Oy be the number of communities to 
which vertex v belongs. The extended modularity is de- 
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fined as 

EQ = —y 



y 

veCi,weCi 



2m 



(2) 



Note that EQ reduces to Q in [9] wiien eacii vertex be- 
longs to only one community (Readers can refer to 
for details), and EQ is equal to when all nodes belong 
to the same community. In addition, it will be shown 
later in Sec lIIH a high value of EQ indicates a significant 
overlapping community structure. 

Alike to modularity, the extended modularity suffers a 
resolution limit beyond which no modular structure can 
be detected even though these modules might have their 
own entity. As to EAGLE, however, these modules can 
be still detected by further applying the algorithm to each 
community found until none of them can be divided into 
smaller ones. Thus, we obtain a hierarchy of overlapping 
communities which reveals the community structure of 
network more completely. 

Then we analyze the time complexity of the algorithm. 
Let n be the number of vertices, s be the number of max- 
imal cliques in the initial state of the algorithm, and h be 
the number of pair of maximal cliques which are neigh- 
bors (connected by edges or overlap with each other). We 
firstly consider the first stage of the algorithm. In step 1, 
O(n^) operations are needed to calculate the similarity 
between each pair of initial communities. In step 2, we 
only consider the pairs of communities which are neigh- 
bors. Each selection costs h operations and each time of 
join costs 0{n) operations at most. Totally, we carry on a 
maximum of s — 1 join operations. Thus the first stage of 
the algorithm takes at most 0(n^ + {h-\-n)s) operations. 
As to the second stage, we need to calculate the value of 
EQ corresponding to each cover. In our implementation, 
we calculate the value of EQ for the initial cover and up- 
date it after each join of two selected communities into 
a new one. Each time of update costs at most oper- 
ations. Hence, the second state of the algorithm takes 
at most 0{n'^s) operations. In addition, we need to find 
out all the maximal cliques in the network. It is widely 
believed to be a non-polynomial problem. However, for 
real- world networks, finding all the maximal cliques is 
easy due to the spareness of these networks. 

Compared to the Newman's fast algorithm and the 
k-clique algorithm, the algorithm EAGLE is time- 
consuming. We leave it as a future work that how to 
improve the speed of EAGLE. 



III. APPLICATIONS 

In this section, we apply the algorithm EAGLE to two 
real- world complex networks, the word association net- 
work and the scientific collaboration network. The re- 
sults show that EAGLE can discover new knowledge and 
insights underlying these networks. 

The test data of the two networks are from the 
demo of the CFinder [2^. The two networks comprise 



7207,16662 nodes and 31784,22446 edges, respectively. 
The average clustering coefficients [16] are approximately 
0.15 and 0.19, which indicate that these networks have 
significant community structures in general. 

The word association network is constructed from the 
South Florida Free Association norms list. The origi- 
nal network is directed and weighted. The weight of a 
directed link from one word to another indicates the fre- 
quency that the people in the survey associated the end 
point of the link with its start point. The directed links 
are replaced by undirected ones with a weight equal to the 
sum of the weights of the corresponding two oppositely 
directed links. Furthermore, the links with weight less 
than 0.025 are deleted. The scientific collaboration net- 
work is from the co-authorship network of Los Alamos e- 
print archives. Each article in the archive between April 
1998 and February 2004 contributes the value l/(n — 1) to 
the weight of the link between every pair of its n authors. 
The link with weight less than 1.0 is omitted. 

In the word association network, totally 17 commu- 
nities are found by our algorithm - see Fig|3l^a), left 
panel. Among these communities, 63 of 136 possible 
pairs of communities overlap with each other. To in- 
vestigate what is correlated to the community structure, 
we apply our algorithm to each of these communities 
again. The sub-community structure of one community 
is given in FiglSKa), middle panel. Each of these sub- 
communities have certain correlation with the semantic 
meaning of words. For example, most of the words in 
the community with size 112 are related to the family of 
animals in Africa. This community is explored further 
and four communities are found, shown in Figl3](a), right 
panel. Each community is associated with animals from 
the same family, namely rodentia, felidae & primates, 
cervidae & caprinae, and equidae respectively. The de- 
tails of one community are also illustrated in Figl3](a), 
right panel. Two large communities correspond to words 
associated with animals from cervidae and caprinae re- 
spectively. The overlapped word Animal acts as a bridge 
between the two communities. Three small communities 
comprise peripheral words. 

Applying our algorithm to the scientific collabora- 
tion network, we obtain totally 1754 communities - see 
FigEJb), left panel, with the corresponding high value 
of EQ ^ 0.85. Three large communities contains 23.4% 
of all the vertices, while the others are relatively small. 
The three large communities correspond closely to sub- 
ject subareas: the biggest one mainly to mes-hall and 
str-el^ the second biggest one to str-el and supr-con^ and 
the other to stat-mech^ dis-nn and soft We further apply 
the algorithm to one community and it is broken down 
into 26 sub-communities - depicted in Fig|3l^b), middle 
panel. There appears to be a correlation between the sub- 
community structure and the regional divisions of the 
scientific researchers. For example, most of the members 
of the community with size 166 work in Europe. More 
specific regional information can be obtained when ap- 
plying the algorithm to this community. The biggest one 
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a) Word association network 



Communities in the Los Alamos cond-mat Archive, 
16662 vertices 

EQ=:0.86 




+1751 smaller communities 
No community overlap is found 



For each community, most of its members are from 
the adjacency region 

EQsO.79 




+19 smaller communities. 
No community overlap is found 



1 1 communities are found (EQ'^ 0.55). We give the 
biggest one as an example. (Only the vertices with the 
degree larger than 3 are depicted. ) 

/. Giardina EQ = 0.26 




b) Scientific collaboration network 



FIG. 3: The hierarchical and overlapping community structure in a) the word association network, and b) the scientific 
collaboration network. Each numbered circle denotes a community and the number in the circle denotes its size. Communities 
connected by a link overlap with each other. Different communities are rendered in difTerent colors. The overlapping nodes and 
edges between communities are colored in red. In addition, the values of the corresponding EQ are also given when breaking 
networks (communities) down into communities (sub-communities). 




FIG. 4: The hierarchical community structure found by Newman's fast algorithm in the scientific collaboration network. Each 
numbered circle denotes a community and the number in the circle denotes its size. Communities connected by a link overlap 
with each other. Different communities are rendered in different colors. The overlapping nodes and edges between communities 
are colored in red. In addition, the values of the corresponding Q are also given when breaking networks (communities) down 
into communities (sub-communities). 
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FIG. 5: The overlapping community structure around the 
node G. Parisi in the scientific collaboration network. Differ- 
ent communities are rendered in different colors. The Over- 
lapping nodes and edges between communities are colored in 
red. Here, k is set to be 4. 

and its sub- community structure are given in FiglH^b), 
right panel. We can see that the author G. Parisi (who 
is well known for having made significant contributions 
in different fields of physics) acts as a hub in the commu- 
nity. Different communities can be associated with his 
different fields of interest. 

Now, we compare the algorithm EAGLE with New- 
man's fast algorithm and the k-clique algorithm by ap- 
plying them to the scientific collaboration network. Fig- 
ure m shows that the hierarchical community structure 
found by Newman's fast algorithm. The number of com- 
munities at each level of the hierarchy is almost identical 
to that found by the algorithm EAGLE although the size 
of each community is somewhat different. Compare the 
left panel of FiglH with that of Figl3](b), one commu- 
nity disappears. Actually, it is divided into several other 
smaller communities, which are not depicted. As to the 
right panels, the details of communities were given. The 
node G. Parisi, acting as a hub in Fig 131 only appear in 
one community in FigHl The reason is that Newman's 
algorithm gives rise to partitions of network, while the 
algorithm EAGLE allows overlaps between communities. 
Note that overlap between communities is a very com- 
mon phenomenon in real networks and may contribute 



to the evolvement of communities and the dynamics of 
networks. 

Figure [5] shows the overlapping community structure 
around the node G. Parisi in the scientific collaboration 
network. Compare to FiglSl both the algorithm EAGLE 
and the k-clique algorithm can find the overlapping com- 
munity structure, although the overlapped communities 
found by the two algorithm are somewhat different. How- 
ever, the algorithm EAGLE can give the hierarchy of 
these overlapped communities compared to the k-clique 
algorithm. The hierarchy of communities is useful to un- 
derstand the community structure of real world networks. 



IV. CONCLUSIONS AND DISCUSSIONS 

In this paper, we propose an algorithm, namely EA- 
GLE, to uncover both the hierarchical and overlapping 
properties of community structure in complex networks. 
This algorithm deals with the set of maximal cliques and 
adopts an agglomerative framework. The effectiveness 
of this algorithm is demonstrated by applications to two 
real- world networks, namely the word association net- 
work and the scientific collaboration network. Results 
also show that the algorithm EAGLE provides a possible 
way to gain a more complete picture of the community 
structure of networks. Note that only un- weighted and 
undirected networks are considered in this paper. In our 
further work, EAGLE will be generalized to the weighted 
and/or directed networks. How to improve the eciency 
of EAGLE will also be considered. 

Our method can help to analyze the community struc- 
ture of some very large networks. It can also shed some 
light on understanding the topological and dynamical be- 
havior of some large technological, social and biological 
network systems. 
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